Annotating Article Errors in Spanish Learner Texts: Design and Evaluation of an Annotation Scheme
نویسندگان
چکیده
Annotating a corpus with error information is a challenging task. This paper describes the design, evaluation and refinement of an annotation scheme for Spanish article errors in learner data, so that future work on corpus annotation and automatic article error detection can progress. To evaluate reliability, 300 noun phrases with definite, indefinite and zero article have been tagged by four annotators. We analysed different types of disagreement, presented suggestions to increase reliability and applied the refined annotation scheme to create a gold-standard annotation.
منابع مشابه
Annotating foreign learners’ Czech
One of the challenges of contemporary corpus linguistics is the compilation and annotation of corpora consisting of texts produced by non-native speakers. In addition to morphosyntactic tagging and lemmatisation, such texts can be annotated by information relevant to the specific nonstandard use. Cases of deviant language use can be corrected and identified by a tag specifying the type of the e...
متن کاملBuilding a learner corpus
The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked levels to cope with a wide range of error types present in the input. Each level corrects different types of errors; links between the levels allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, bu...
متن کاملAnnotating Errors in a Hungarian Learner Corpus
We are developing and annotating a learner corpus of Hungarian, composed of student journals from three different proficiency levels written at Indiana University. Our annotation marks learner errors that are of different linguistic categories, including phonology, morphology, and syntax, but defining the annotation for an agglutinative language presents several issues. First, we must adapt an ...
متن کاملEvaluating and automating the annotation of a learner corpus
The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked tiers, designed to handle a wide range of error types present in the input. Each tier corrects different types of errors; links between the tiers allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected...
متن کاملAn annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کامل